CytoAtlas

Pan-Disease Single-Cell Cytokine Activity Atlas
Report for: Peng Jiang, Ph.D. — CDSL, National Cancer Institute
Prepared by: Seongyong Park
Date: February 2026

Executive Summary

CytoAtlas is a comprehensive computational resource that maps cytokine and secreted protein signaling activity across 240 million human cells from six independent datasets spanning healthy donors, inflammatory diseases, cancers, drug perturbations, and spatial transcriptomics. The system uses linear ridge regression against experimentally derived signature matrices to infer activity — producing fully interpretable, conditional z-scores rather than black-box predictions.

240MTotal Cells
6Datasets
1,293Signatures
8Validation Atlases
262API Endpoints
12Web Pages

Key results:

  • 1,293 signatures (44 CytoSig + 178 LinCytoSig + 1,249 SecAct) validated across 8 independent atlases
  • Spearman correlations reach ρ=0.6–0.9 for well-characterized cytokines (IL1B, TNFA, VEGFA, TGFB family)
  • Cross-atlas consistency demonstrates signatures generalize across CIMA, Inflammation Atlas, scAtlas, GTEx, and TCGA
  • LinCytoSig improves prediction for select immune cell types (Basophil, NK, DC: +0.18–0.21 Δρ)
  • SecAct achieves the highest correlations in bulk & organ-level analyses (median ρ=0.40 in GTEx/TCGA)

1. System Architecture and Design Rationale

1.1 Why This Architecture?

CytoAtlas was designed around three principles that distinguish it from typical bioinformatics databases:

Principle 1: Linear interpretability over complex models.
Ridge regression (L2-regularized linear regression) was chosen deliberately over methods like autoencoders, graph neural networks, or foundation models. The resulting activity z-scores are conditional on the specific genes in the signature matrix, meaning every prediction can be traced to a weighted combination of known gene responses.

Principle 2: Multi-level validation at every aggregation.
CytoAtlas validates at five levels: donor-level pseudobulk, donor × cell-type pseudobulk, single-cell, bulk RNA-seq (GTEx/TCGA), and bootstrap resampled with confidence intervals.

Principle 3: Reproducibility through separation of concerns.

ComponentTechnologyPurpose
PipelinePython + CuPy (GPU)Activity inference, 10–34x speedup
StorageDuckDB (3 databases, 68 tables)Columnar analytics, no server needed
APIFastAPI (262 endpoints)RESTful data access, caching, auth
FrontendReact 19 + TypeScriptInteractive exploration (12 pages)

1.2 Processing Scale

DatasetCellsSamplesTimeGPU
CIMA6.5M421 donors~2hA100
Inflammation Atlas6.3M1,047 samples~2hA100
scAtlas6.4M781 donors~2hA100
parse_10M9.7M1,092 conditions~3hA100
Tahoe-100M100.6M14 plates~12hA100
SpatialCorpus-110M~110M251 datasets~12hA100
Figure 1: Dataset Overview
Figure 1. CytoAtlas overview. (A) Cell counts across 6 datasets totaling 240M cells. (B) Three signature matrices. (C) Multi-level validation strategy.

2. Dataset Catalog

2.1 Datasets and Scale

#DatasetCellsDonors/SamplesCell TypesReference
1CIMA6,484,974421 donors27 L2 / 100+ L3J. Yin et al., Science, 2026
2Inflammation Atlas6,340,9341,047 samples66+Jimenez-Gracia et al., Nature Medicine, 2026
3scAtlas6,440,926781 donors100+Q. Shi et al., Nature, 2025
4parse_10M9,697,97412 donors × 91 cytokines18 PBMC typesOesinghaus et al., bioRxiv, 2026
5Tahoe-100M~100,600,00050 cell lines × 95 drugs50 cell linesZhang et al., bioRxiv, 2026
6SpatialCorpus-110M~110,000,000251 spatial datasetsVariableTejada-Lapuerta et al., Nature Methods, 2025

2.2 Disease and Condition Categories

Inflammation Atlas (20 diseases): RA, SLE, Sjogren's, PSA, Crohn's, UC, COVID-19, Sepsis, HIV, HBV, BRCA, CRC, HNSCC, NPC, COPD, Cirrhosis, MS, Asthma, Atopic Dermatitis

scAtlas: Normal (35+ organs) + Cancer (15+ types: LUAD, CRC, BRCA, LIHC, PAAD, KIRC, OV, SKCM, GBM, etc.)

parse_10M: 90 cytokines × 12 donors — independent in vitro perturbation dataset for comparison. A considerable portion of cytokines (~58%) are produced in E. coli, with the remainder from insect (Sf21, 12%) and mammalian (CHO, NS0, HEK293, ~30%) expression systems. Because exogenous perturbagens may induce effects differing from endogenously produced cytokines, parse_10M serves as an independent comparison rather than strict ground truth. CytoSig/SecAct has a potential advantage in this regard, as it infers relationships directly from physiologically relevant samples.

Tahoe-100M: 95 drugs across 50 cancer cell lines

SpatialCorpus: Visium, Xenium, MERFISH, MERSCOPE, CosMx, ISS, Slide-seq — 30+ tissue types

2.3 Signature Matrices

MatrixTargetsConstructionReference
CytoSig44 cytokinesMedian log2FC across all experimental bulk RNA-seqJiang et al., Nature Methods, 2021
LinCytoSig178 (45 cell types × 1–13 cytokines)Cell-type-stratified median from CytoSig database (methodology)This work
SecAct1,249 secreted proteinsMedian global Moran's I across 1,000 Visium datasetsRu et al., Nature Methods, 2026 (in press)

3. Scientific Value Proposition

3.1 What Makes CytoAtlas Different from Deep Learning Approaches?

Most single-cell analysis tools use complex models (VAEs, GNNs, transformers) that produce aggregated, non-linear representations difficult to interpret biologically. CytoAtlas takes the opposite approach:

PropertyCytoAtlas (Ridge Regression)Typical DL Approach
ModelLinear (z = Xβ + ε)Non-linear (multi-layer NN)
InterpretabilityEvery gene's contribution is a coefficientFeature importance approximated post-hoc
ConditionalityActivity conditional on specific gene setLatent space mixes all features
ConfidencePermutation-based z-scores with CIOften point estimates only
GeneralizationTested across 8 independent cohortsOften held-out splits of same cohort
BiasTransparent — limited by signature matrix genesHidden in architecture and training data

The key insight: CytoAtlas is not trying to replace DL-based tools. It provides an orthogonal, complementary signal that a human scientist can directly inspect. When CytoAtlas says "IFNG activity is elevated in CD8+ T cells from RA patients," you can verify this by checking the IFNG signature genes in those cells.

3.2 What Scientific Questions Does CytoAtlas Answer?

  1. Which cytokines are active in which cell types across diseases?
  2. Are cytokine activities consistent across independent cohorts?
  3. Does cell-type-specific biology matter for cytokine inference?
  4. Which secreted proteins beyond cytokines show validated activity?
  5. How do drugs alter cytokine activity in cancer cells?
  6. What is the spatial organization of cytokine signaling?
  7. Can we predict treatment response from cytokine activity?

3.3 Validation Philosophy

CytoAtlas validates against a simple but powerful principle: if CytoSig predicts high IFNG activity for a sample, that sample should have high IFNG gene expression. This expression-activity correlation is computed via Spearman rank correlation across donors/samples.

This is a conservative validation — it only captures signatures where the target gene itself is expressed. Signatures that act through downstream effectors would not be captured, meaning our validation underestimates true accuracy.


4. Validation Results

4.1 Overall Performance Summary

How “N Targets” is determined: A target is included in the validation for a given atlas only if (1) the target’s signature genes overlap sufficiently with the atlas gene expression matrix, and (2) the target gene itself is expressed in enough samples to compute a meaningful Spearman correlation. Targets whose gene is absent or not detected in a dataset are excluded.

Donor-only atlases (CIMA, Inflammation, GTEx, TCGA): N = number of unique targets with valid correlations. CytoSig defines 43 cytokines and SecAct defines 1,170 secreted proteins. The Inflammation Atlas (main/validation cohorts) retains only 33 of 43 CytoSig targets and 805 of 1,170 SecAct targets because 10 cytokine genes (BDNF, BMP4, CXCL12, GCSF, IFN1, IL13, IL17A, IL36, IL4, WNT3A) are not sufficiently expressed in these blood/PBMC samples. CIMA, GTEx, and similar multi-organ datasets retain nearly all targets (≥97%).

Donor-organ atlases (scAtlas Normal, scAtlas Cancer): N = target × organ pairs, because validation is stratified by organ/tissue context. For scAtlas Normal, each target is validated independently across 25 organs (Bladder, Blood, Breast, Colon, Heart, Kidney, Liver, Lung, etc.), yielding up to 43 × 25 = 1,075 CytoSig entries (actual: 1,013 after filtering) and 1,140 × 25 = 28,500 SecAct entries (actual: 27,154). For scAtlas Cancer, validation spans 7 tissue contexts (Tumor, Adjacent, Blood, Metastasis, Pleural Fluids, Pre-Lesion, All), yielding 43 × 7 = 301 CytoSig entries (actual: 295) and 1,140 × 7 = 7,980 SecAct entries (actual: 7,809). Some target-organ pairs are excluded when the target gene lacks sufficient expression in that organ.

Note on scAtlas duplicate entries: At finer aggregation levels (e.g., donor_organ_celltype1 vs donor_organ_celltype2), the same target can appear multiple times with different correlation values. This is expected — finer cell-type annotation changes the composition of each pseudobulk sample, yielding different expression-activity relationships. The summary table above uses the donor_organ level for scAtlas.

4.2 Correlation Distributions

Figure 2. Spearman ρ distributions across atlases for CytoSig (44 targets), SecAct (1,249 targets), and SecAct restricted to CytoSig-matched targets (22 shared targets). Donor-level pseudobulk. Hover for details.

Why does SecAct appear to underperform CytoSig in the Inflammation Atlas?

This is a composition effect, not a genuine performance gap. CytoSig tests only 43 curated, high-signal cytokines, while SecAct tests 1,249 secreted proteins — including many tissue-expressed targets (collagens, metalloproteinases, apolipoproteins, complement factors) with minimal expression variation in blood/PBMC samples. On the 22 matched targets shared between both methods, SecAct consistently outperforms CytoSig across all atlases (e.g., median ρ = 0.51 vs 0.32 in Inflammation Main).

The Inflammation Atlas is largely blood-derived, so many SecAct targets that perform well in multi-organ contexts (scAtlas, GTEx, TCGA) contribute near-zero or negative correlations here. In fact, 99 SecAct targets are negative only in inflammation but positive in all other atlases, reflecting tissue-specific expression limitations rather than inference failure. The “SecAct (CytoSig-matched)” boxplot above demonstrates the fair comparison on equal footing.

4.3 Best and Worst Correlated Targets

Figure 3. Top 15 (best) and bottom 15 (worst) correlated targets. Select signature type and atlas from dropdowns.

Consistently well-correlated targets (ρ > 0.3 across multiple atlases):

Consistently poorly correlated targets (ρ < 0 in multiple atlases):

Gene mapping verified: All four targets are correctly mapped (CD40L→CD40LG, TRAIL→TNFSF10, LTA→LTA, HGF→HGF). No gene ID confusion exists. The poor correlations reflect specific molecular mechanisms:

TargetGeneDominant MechanismContributing Factors
CD40LCD40LG Platelet-derived sCD40L invisible to scRNA-seq (~95% of circulating CD40L); ADAM10-mediated membrane shedding Unstable mRNA (3′-UTR destabilizing element); transient expression kinetics (peak 6–8h post-activation); paracrine disconnect (T cell → B cell/DC)
TRAILTNFSF10 Three decoy receptors (DcR1/TNFRSF10C, DcR2/TNFRSF10D, OPG/TNFRSF11B) competitively sequester ligand without signaling Non-functional splice variants (TRAIL-beta, TRAIL-gamma lack exon 3) inflate mRNA counts; cathepsin E-mediated shedding; apoptosis-induced survival bias in scRNA-seq data
LTALTA Obligate heteromeric complex with LTB: the dominant form (LTα1β2) requires LTB co-expression and signals through LTBR, not TNFR1/2 Mathematical collinearity with TNFA in ridge regression (LTA3 homotrimer binds the same TNFR1/2 receptors as TNF-α); 7 known splice variants; low/transient expression
HGFHGF Obligate mesenchymal-to-epithelial paracrine topology: HGF produced by fibroblasts/stellate cells, MET receptor on epithelial cells Secreted as inactive pro-HGF requiring proteolytic cleavage by HGFAC/uPA (post-translational activation is rate-limiting); ECM/heparin sequestration creates stored protein pool invisible to transcriptomics

Key insight: None of these targets have isoforms or subunits mapping to different gene IDs that would cause gene ID confusion. The poor correlations are driven by post-translational regulation (membrane shedding, proteolytic activation, decoy receptor sequestration), paracrine signaling topology (producer and responder cells are different cell types), and heteromeric complex dependence (LTA requires LTB). These represent fundamental limitations of using ligand mRNA abundance to predict downstream signaling activity — the CytoSig activity scores themselves remain valid readouts of pathway activation in the measured cells.

4.4 Cross-Atlas Consistency

Figure 4. Key cytokine target correlations tracked across 8 independent atlases (CytoSig, donor-level). Lines are colored by cytokine family: Interferon (red), TGF-β (blue), Interleukin (green), TNF (amber), Growth Factor (purple), Chemokine (pink), Colony-Stimulating (indigo). Click legend entries to show/hide targets.

4.5 Effect of Aggregation Level

Figure 5. Effect of cell-type annotation granularity on validation correlations. CytoSig (43 targets), SecAct (1,249 targets), and SecAct restricted to CytoSig-matched targets (22 shared targets) shown side by side. Select atlas from dropdown.

Aggregation levels explained: Pseudobulk profiles are aggregated at increasingly fine cell-type resolution. At coarser levels, each pseudobulk profile averages more cells, yielding smoother expression estimates but masking cell-type-specific signals. At finer levels, each profile is more cell-type-specific but based on fewer cells.

AtlasLevelDescriptionN Cell Types
CIMA Donor OnlyWhole-sample pseudobulk per donor1 (all)
Donor × L1Broad lineages (B, CD4_T, CD8_T, Myeloid, NK, etc.)7
Donor × L2Intermediate (CD4_memory, CD8_naive, DC, Mono, etc.)28
Donor × L3Fine-grained (CD4_Tcm, cMono, Switched_Bm, etc.)39
Donor × L4Finest marker-annotated (CD4_Th17-like_RORC, cMono_IL1B, etc.)73
Inflammation Donor OnlyWhole-sample pseudobulk per donor1 (all)
Donor × L1Broad categories (B, DC, Mono, T_CD4/CD8 subsets, etc.)18
Donor × L2Fine-grained (Th1, Th2, Tregs, NK_adaptive, etc.)65
scAtlas Normal Donor × OrganPer-organ pseudobulk (Bladder, Blood, Breast, Lung, etc.)25 organs
Donor × Organ × CT1Broad cell types within each organ191
Donor × Organ × CT2Fine cell types within each organ356

4.6 Representative Scatter Plots

Figure 6. Donor-level expression vs CytoSig predicted activity. Select target and atlas from dropdowns.

4.7 Biologically Important Targets Heatmap

Figure 7. Spearman ρ heatmap for biologically important targets across all atlases. Switch between signature types. Hover over cells for details.

How each correlation value is computed: For each (target, atlas) cell, we compute Spearman rank correlation between predicted cytokine activity (ridge regression z-score) and target gene expression across all donor-level pseudobulk samples. Specifically:

  1. Pseudobulk aggregation: For each atlas, gene expression is aggregated to the donor level (one profile per donor or donor × cell type).
  2. Activity inference: Ridge regression (secactpy.ridge, λ=5×105) is applied using the signature matrix (CytoSig: 4,881 genes × 43 cytokines; SecAct: 7,919 genes × 1,249 targets) to predict activity z-scores for each pseudobulk sample.
  3. Correlation: Spearman ρ is computed between the predicted activity z-score and the original expression of the target gene across all donor-level samples within that atlas. A positive ρ means higher predicted activity tracks with higher target gene expression.

GTEx/TCGA use donor-only pseudobulk; CIMA uses donor-only; Inflammation uses donor-only; scAtlas uses donor × organ.

4.8 Bulk RNA-seq Validation (GTEx & TCGA)

Figure 8. Bulk RNA-seq validation: targets ranked by Spearman ρ. Select dataset and signature type from dropdowns.

5. CytoSig vs LinCytoSig vs SecAct Comparison

5.1 Method Overview

PropertyCytoSigLinCytoSigSecAct
Targets43 cytokines178 (45 cell types × 1–13 cytokines)1,249 secreted proteins
SpecificityGlobal (cell-type agnostic)Cell-type specificGlobal
SourceExperimental bulk RNA-seqCytoSig stratified by cell type (full methodology)Spatial Moran's I
Best forGeneral cytokine activityCell-type-resolved analysisBroad secretome profiling
Figure 9. Six-way signature method comparison at matched (cell type, cytokine) pair level. All 6 methods are evaluated on the same set of matched pairs per atlas (identical n). Use dropdown to view individual atlas boxplots. Inflammation atlases excluded (Ensembl gene IDs). For LinCytoSig construction, see LinCytoSig Methodology.

Six methods compared on identical matched pairs:

  1. CytoSig — 43 cytokines, 4,881 curated genes, cell-type agnostic (pooled from all cell types)
  2. LinCytoSig (orig) — cell-type-matched signatures from the CytoSig database, all 19,918 genes
  3. LinCytoSig (gene-filtered) — same cell-type-matched signatures, restricted to CytoSig’s 4,881 curated genes
  4. LinCytoSig (best-bulk) — for each cytokine, select the single best-performing cell-type signature based on GTEx+TCGA bulk RNA-seq correlation (all 19,918 genes)
  5. LinCytoSig (best-bulk+filt) — same best-bulk selection, restricted to CytoSig’s 4,881 genes
  6. SecAct — 1,249 secreted protein signatures (Moran’s I spatial method), shown for the subset of CytoSig-overlapping targets

Key findings: SecAct achieves the highest median ρ across all atlases. CytoSig outperforms the cell-type-matched LinCytoSig (orig) across all three atlases, largely because LinCytoSig signatures have fewer experiments (3–12 vs 50–300+) and more genes (19,918 vs 4,881), amplifying noise. The “best-bulk” selection strategy (selecting one representative cell-type signature per cytokine based on GTEx+TCGA bulk correlation) substantially improves performance, approaching or exceeding CytoSig. Gene filtering helps (orig < filt) consistently, confirming that restricting to CytoSig’s curated 4,881 genes reduces noise.

5.2 When Does LinCytoSig Outperform CytoSig?

Figure 10. Matched target correlation comparison (celltype-level). Points above diagonal = LinCytoSig outperforms CytoSig. Select atlas from dropdown.

LinCytoSig wins: Basophil (+0.21), NK Cell (+0.19), Dendritic Cell (+0.18)

CytoSig wins: Lymphatic Endothelial (−0.73), Adipocyte (−0.44), Osteocyte (−0.40), PBMC (−0.38)

Recommendation: Use LinCytoSig for cell-type-resolved questions and CytoSig for donor-level questions.

5.3 SecAct: Breadth Over Depth

Figure: SecAct Novel
Figure 12. Top 30 novel SecAct targets with consistent positive correlation. Distribution of all SecAct mean ρ values.

5.4 LinCytoSig Specificity Deep Dive

Figure: LinCytoSig Advantage
Figure 11. LinCytoSig advantage by cell type. Basophil, NK Cell, and Dendritic Cell benefit most.
Figure: LinCytoSig Specificity
Figure 13. Top 20 cases where LinCytoSig outperforms vs underperforms CytoSig.

6. Key Takeaways for Scientific Discovery

6.1 What CytoAtlas Enables

  1. Quantitative cytokine activity per cell type per disease
  2. Cross-disease comparison — same 44 CytoSig signatures across 20 diseases, 35 organs, 15 cancer types
  3. Independent perturbation comparison — parse_10M provides 90 cytokine perturbations × 12 donors × 18 cell types for independent comparison with CytoSig predictions
  4. Drug-cytokine interaction — Tahoe-100M maps 95 drugs × 50 cancer cell lines
  5. Spatial context — SpatialCorpus-110M maps cytokine activity to spatial neighborhoods

6.2 Limitations

  1. Linear model: Cannot capture non-linear cytokine interactions
  2. Transcriptomics-only: Post-translational regulation invisible
  3. Signature matrix bias: Underrepresented cell types have weaker signatures
  4. Validation metric: Expression-activity correlation underestimates true accuracy

6.3 Future Directions

  1. scGPT cohort integration (~35M cells)
  2. cellxgene Census integration
  3. Drug response prediction models
  4. Spatial cytokine niches
  5. Treatment response biomarkers

7. Appendix: Technical Specifications

A. Computational Infrastructure

B. Statistical Methods